1 Introduction

Unbanked individuals represent both an underserved demographic as well as a typically untapped market by reputable creditors. Home Credit seeks to fill this gap in service. There are unique challenges that accompany establishing creditworthiness among a population that by definition has little to no financial history, verifiable assets, or traditional means to qualify for a loan.

1.1 Project Scope

This project will utilize machine learning algorithms to develop a classification model which will use available data about Home Credit customers to improve prediction of those that are likely to repay loans granted by Home Credit. The team will test a number of possible classification models in order to develop the most accurate model on data outside the training data. An added benefit of this project is the potential to identify possible additional data that might further inform the classification model. A successful model will provide greater predictive power than a simple prediction based upon majority class statistics and will allow Home Credit to loan to customers with confidence that repayment will in return grow available assets to the company in order to further its mission of providing credit to the underserved.

Below, we explore the dataset and provide some preliminary analysis to determine if the dataset is adequate to move forward with data modelling.

1.2 Load Libraries

library(tidyverse)
library(skimr)
library(scales)
library(ggplot2)
library(corrplot)
library(stringr)
library(reshape2)
library(gridExtra)
library(grid)
library(gmodels)

1.3 Reading train file

#Read application_train.csv file
app_train <- read.csv("../data/application_train.csv")

2 Description of Data

#View structure and summary of data
str(app_train)
## 'data.frame':    307511 obs. of  122 variables:
##  $ SK_ID_CURR                  : int  100002 100003 100004 100006 100007 100008 100009 100010 100011 100012 ...
##  $ TARGET                      : int  1 0 0 0 0 0 0 0 0 0 ...
##  $ NAME_CONTRACT_TYPE          : chr  "Cash loans" "Cash loans" "Revolving loans" "Cash loans" ...
##  $ CODE_GENDER                 : chr  "M" "F" "M" "F" ...
##  $ FLAG_OWN_CAR                : chr  "N" "N" "Y" "N" ...
##  $ FLAG_OWN_REALTY             : chr  "Y" "N" "Y" "Y" ...
##  $ CNT_CHILDREN                : int  0 0 0 0 0 0 1 0 0 0 ...
##  $ AMT_INCOME_TOTAL            : num  202500 270000 67500 135000 121500 ...
##  $ AMT_CREDIT                  : num  406598 1293503 135000 312683 513000 ...
##  $ AMT_ANNUITY                 : num  24701 35699 6750 29687 21866 ...
##  $ AMT_GOODS_PRICE             : num  351000 1129500 135000 297000 513000 ...
##  $ NAME_TYPE_SUITE             : chr  "Unaccompanied" "Family" "Unaccompanied" "Unaccompanied" ...
##  $ NAME_INCOME_TYPE            : chr  "Working" "State servant" "Working" "Working" ...
##  $ NAME_EDUCATION_TYPE         : chr  "Secondary / secondary special" "Higher education" "Secondary / secondary special" "Secondary / secondary special" ...
##  $ NAME_FAMILY_STATUS          : chr  "Single / not married" "Married" "Single / not married" "Civil marriage" ...
##  $ NAME_HOUSING_TYPE           : chr  "House / apartment" "House / apartment" "House / apartment" "House / apartment" ...
##  $ REGION_POPULATION_RELATIVE  : num  0.0188 0.00354 0.01003 0.00802 0.02866 ...
##  $ DAYS_BIRTH                  : int  -9461 -16765 -19046 -19005 -19932 -16941 -13778 -18850 -20099 -14469 ...
##  $ DAYS_EMPLOYED               : int  -637 -1188 -225 -3039 -3038 -1588 -3130 -449 365243 -2019 ...
##  $ DAYS_REGISTRATION           : num  -3648 -1186 -4260 -9833 -4311 ...
##  $ DAYS_ID_PUBLISH             : int  -2120 -291 -2531 -2437 -3458 -477 -619 -2379 -3514 -3992 ...
##  $ OWN_CAR_AGE                 : num  NA NA 26 NA NA NA 17 8 NA NA ...
##  $ FLAG_MOBIL                  : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ FLAG_EMP_PHONE              : int  1 1 1 1 1 1 1 1 0 1 ...
##  $ FLAG_WORK_PHONE             : int  0 0 1 0 0 1 0 1 0 0 ...
##  $ FLAG_CONT_MOBILE            : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ FLAG_PHONE                  : int  1 1 1 0 0 1 1 0 0 0 ...
##  $ FLAG_EMAIL                  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ OCCUPATION_TYPE             : chr  "Laborers" "Core staff" "Laborers" "Laborers" ...
##  $ CNT_FAM_MEMBERS             : num  1 2 1 2 1 2 3 2 2 1 ...
##  $ REGION_RATING_CLIENT        : int  2 1 2 2 2 2 2 3 2 2 ...
##  $ REGION_RATING_CLIENT_W_CITY : int  2 1 2 2 2 2 2 3 2 2 ...
##  $ WEEKDAY_APPR_PROCESS_START  : chr  "WEDNESDAY" "MONDAY" "MONDAY" "WEDNESDAY" ...
##  $ HOUR_APPR_PROCESS_START     : int  10 11 9 17 11 16 16 16 14 8 ...
##  $ REG_REGION_NOT_LIVE_REGION  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ REG_REGION_NOT_WORK_REGION  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ LIVE_REGION_NOT_WORK_REGION : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ REG_CITY_NOT_LIVE_CITY      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ REG_CITY_NOT_WORK_CITY      : int  0 0 0 0 1 0 0 1 0 0 ...
##  $ LIVE_CITY_NOT_WORK_CITY     : int  0 0 0 0 1 0 0 1 0 0 ...
##  $ ORGANIZATION_TYPE           : chr  "Business Entity Type 3" "School" "Government" "Business Entity Type 3" ...
##  $ EXT_SOURCE_1                : num  0.083 0.311 NA NA NA ...
##  $ EXT_SOURCE_2                : num  0.263 0.622 0.556 0.65 0.323 ...
##  $ EXT_SOURCE_3                : num  0.139 NA 0.73 NA NA ...
##  $ APARTMENTS_AVG              : num  0.0247 0.0959 NA NA NA NA NA NA NA NA ...
##  $ BASEMENTAREA_AVG            : num  0.0369 0.0529 NA NA NA NA NA NA NA NA ...
##  $ YEARS_BEGINEXPLUATATION_AVG : num  0.972 0.985 NA NA NA ...
##  $ YEARS_BUILD_AVG             : num  0.619 0.796 NA NA NA ...
##  $ COMMONAREA_AVG              : num  0.0143 0.0605 NA NA NA NA NA NA NA NA ...
##  $ ELEVATORS_AVG               : num  0 0.08 NA NA NA NA NA NA NA NA ...
##  $ ENTRANCES_AVG               : num  0.069 0.0345 NA NA NA NA NA NA NA NA ...
##  $ FLOORSMAX_AVG               : num  0.0833 0.2917 NA NA NA ...
##  $ FLOORSMIN_AVG               : num  0.125 0.333 NA NA NA ...
##  $ LANDAREA_AVG                : num  0.0369 0.013 NA NA NA NA NA NA NA NA ...
##  $ LIVINGAPARTMENTS_AVG        : num  0.0202 0.0773 NA NA NA NA NA NA NA NA ...
##  $ LIVINGAREA_AVG              : num  0.019 0.0549 NA NA NA NA NA NA NA NA ...
##  $ NONLIVINGAPARTMENTS_AVG     : num  0 0.0039 NA NA NA NA NA NA NA NA ...
##  $ NONLIVINGAREA_AVG           : num  0 0.0098 NA NA NA NA NA NA NA NA ...
##  $ APARTMENTS_MODE             : num  0.0252 0.0924 NA NA NA NA NA NA NA NA ...
##  $ BASEMENTAREA_MODE           : num  0.0383 0.0538 NA NA NA NA NA NA NA NA ...
##  $ YEARS_BEGINEXPLUATATION_MODE: num  0.972 0.985 NA NA NA ...
##  $ YEARS_BUILD_MODE            : num  0.634 0.804 NA NA NA ...
##  $ COMMONAREA_MODE             : num  0.0144 0.0497 NA NA NA NA NA NA NA NA ...
##  $ ELEVATORS_MODE              : num  0 0.0806 NA NA NA NA NA NA NA NA ...
##  $ ENTRANCES_MODE              : num  0.069 0.0345 NA NA NA NA NA NA NA NA ...
##  $ FLOORSMAX_MODE              : num  0.0833 0.2917 NA NA NA ...
##  $ FLOORSMIN_MODE              : num  0.125 0.333 NA NA NA ...
##  $ LANDAREA_MODE               : num  0.0377 0.0128 NA NA NA NA NA NA NA NA ...
##  $ LIVINGAPARTMENTS_MODE       : num  0.022 0.079 NA NA NA NA NA NA NA NA ...
##  $ LIVINGAREA_MODE             : num  0.0198 0.0554 NA NA NA NA NA NA NA NA ...
##  $ NONLIVINGAPARTMENTS_MODE    : num  0 0 NA NA NA NA NA NA NA NA ...
##  $ NONLIVINGAREA_MODE          : num  0 0 NA NA NA NA NA NA NA NA ...
##  $ APARTMENTS_MEDI             : num  0.025 0.0968 NA NA NA NA NA NA NA NA ...
##  $ BASEMENTAREA_MEDI           : num  0.0369 0.0529 NA NA NA NA NA NA NA NA ...
##  $ YEARS_BEGINEXPLUATATION_MEDI: num  0.972 0.985 NA NA NA ...
##  $ YEARS_BUILD_MEDI            : num  0.624 0.799 NA NA NA ...
##  $ COMMONAREA_MEDI             : num  0.0144 0.0608 NA NA NA NA NA NA NA NA ...
##  $ ELEVATORS_MEDI              : num  0 0.08 NA NA NA NA NA NA NA NA ...
##  $ ENTRANCES_MEDI              : num  0.069 0.0345 NA NA NA NA NA NA NA NA ...
##  $ FLOORSMAX_MEDI              : num  0.0833 0.2917 NA NA NA ...
##  $ FLOORSMIN_MEDI              : num  0.125 0.333 NA NA NA ...
##  $ LANDAREA_MEDI               : num  0.0375 0.0132 NA NA NA NA NA NA NA NA ...
##  $ LIVINGAPARTMENTS_MEDI       : num  0.0205 0.0787 NA NA NA NA NA NA NA NA ...
##  $ LIVINGAREA_MEDI             : num  0.0193 0.0558 NA NA NA NA NA NA NA NA ...
##  $ NONLIVINGAPARTMENTS_MEDI    : num  0 0.0039 NA NA NA NA NA NA NA NA ...
##  $ NONLIVINGAREA_MEDI          : num  0 0.01 NA NA NA NA NA NA NA NA ...
##  $ FONDKAPREMONT_MODE          : chr  "reg oper account" "reg oper account" "" "" ...
##  $ HOUSETYPE_MODE              : chr  "block of flats" "block of flats" "" "" ...
##  $ TOTALAREA_MODE              : num  0.0149 0.0714 NA NA NA NA NA NA NA NA ...
##  $ WALLSMATERIAL_MODE          : chr  "Stone, brick" "Block" "" "" ...
##  $ EMERGENCYSTATE_MODE         : chr  "No" "No" "" "" ...
##  $ OBS_30_CNT_SOCIAL_CIRCLE    : num  2 1 0 2 0 0 1 2 1 2 ...
##  $ DEF_30_CNT_SOCIAL_CIRCLE    : num  2 0 0 0 0 0 0 0 0 0 ...
##  $ OBS_60_CNT_SOCIAL_CIRCLE    : num  2 1 0 2 0 0 1 2 1 2 ...
##  $ DEF_60_CNT_SOCIAL_CIRCLE    : num  2 0 0 0 0 0 0 0 0 0 ...
##  $ DAYS_LAST_PHONE_CHANGE      : num  -1134 -828 -815 -617 -1106 ...
##  $ FLAG_DOCUMENT_2             : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ FLAG_DOCUMENT_3             : int  1 1 0 1 0 1 0 1 1 0 ...
##  $ FLAG_DOCUMENT_4             : int  0 0 0 0 0 0 0 0 0 0 ...
##   [list output truncated]
summary(app_train$TARGET)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.08073 0.00000 1.00000

There are 307511 rows and 122 features including TARGET variable.The target is categorical variable classified as “0” for the loan was repaid on time with no difficulties and “1” indicating the client had payment difficulties.

3 Target Value Analysis

3.1 Missing Value check

#Checking missing value in target variable
missing_count <- sum(is.na(app_train$TARGET))
missing_count
## [1] 0

There are no missing values in Target variable

3.2 Check count distribution of target variable

# View count of clients in each class
table(app_train$TARGET)
## 
##      0      1 
## 282686  24825

Here 24825 clients are observed to have payment difficulties, whereas 282686 people have no payment difficulties.

3.3 Visualize count distribution of target variable

library(ggplot2)
ggplot(app_train, aes(x = as.factor(TARGET))) +
  geom_bar(fill = "steelblue") +
  geom_text(stat = "count", aes(label = ..count..), vjust = -0.5, color = "black", size = 3) +
  labs(x = "Target Variable", y = "Count") +
  ggtitle("Distribution of the Target Variable") +
  scale_y_continuous(breaks = seq(0, max(table(app_train$TARGET)), by = 50000))

Above Target variable distribution shows that majority clients come under no payment difficulties class(0), only 24825 clients have payment difficulties.

# View the proportion of target variable
proportion <- prop.table(table(app_train$TARGET)) * 100
proportion
## 
##         0         1 
## 91.927118  8.072882

91.92% of the clients are observed to have no payment difficulties, 8.07% clients face payment difficulties.The data looks unbalaced with respect target variable as proportion for class 0 is much higher (91.92%) than that of class 1 (8.07%). Special attention should also be given to model Sensitivity and Specificity.

3.4 Visualize the proportion of target variable

# Calculate the proportion and convert it to percentages
proportion <- prop.table(table(app_train$TARGET)) * 100

# Create a data frame for plotting
pie_data <- data.frame(category = names(proportion), proportion = proportion)

# Create the pie chart using ggplot
pie_chart <- ggplot(pie_data, aes(x = "", y = proportion, fill = category)) +
  geom_bar(stat = "identity", width = 1, color = "white") +
  coord_polar("y", start = 0) +
  theme_void() +
  labs(fill = "Category") +
  geom_text(aes(label = paste0(round(proportion, 2), "%")), position = position_stack(vjust = 0.5)) +
  labs(title = "Proportion of TARGET Categories")

# Display the pie chart
print(pie_chart)

3.5 Determine accuracy for majority class classifier

# Count of majority class
majority_count <- max(table(app_train$TARGET))

# Total number of clients in the target class
total_count <- length(app_train$TARGET)

# Accuracy for majority class
accuracy <- majority_count / total_count

accuracy
## [1] 0.9192712

The accuracy for majority class classifier “0” (client with no payment difficulties) is 91.92% which is same as the proportion of class “0” clients in the data.

4 Correlation Analysis

4.1 Correlation of Target and Numeric Predictors

#Getting list of Top5 positive and top 5 negative predictors

# Separate numeric predictors
numeric_vars <- app_train %>% 
                select_if(is.numeric)

# Calculate correlation between numeric predictors and target variable
cor_target <- cor(numeric_vars, numeric_vars$TARGET, use = "pairwise.complete.obs", method = "pearson")
as.data.frame(cor_target) %>% rename('Correlation Coefficient'=V1)
# Create a data frame with predictor names and correlations
df <- data.frame(Target = "TARGET", Variable = colnames(numeric_vars), Correlation = cor_target)

# Sort the data frame by correlation in descending order
sorted_df <- df[order(-abs(df$Correlation)), , drop = FALSE]

# Extract the top 10 positive and top 10 negative correlations
top_pos <- head(sorted_df[sorted_df$Correlation > 0, ], 10)
top_neg <- head(sorted_df[sorted_df$Correlation < 0, ], 10)
##Visualizing correlation using heatmap for numerical predictors with target

# Convert the correlation matrix to a data frame
cor_matrix <- cor(numeric_vars, use = "pairwise.complete.obs")
cor_df <- melt(cor_matrix)

# Rename the columns
colnames(cor_df) <- c("Variable1", "Variable2", "Correlation")

# Sort the correlations by the absolute value
cor_df <- cor_df[order(-abs(cor_df$Correlation)), ]

# Filter the correlations involving the target variable
cor_target <- subset(cor_df, Variable1 == "TARGET" | Variable2 == "TARGET")

# Select the top positive and negative correlations
top_pos <- head(subset(cor_target, Correlation > 0), 5)
top_neg <- head(subset(cor_target, Correlation < 0), 5)

# Combine the top positive and negative correlations
top_cor <- rbind(top_pos, top_neg)

# Filter the correlations involving the top positive and negative predictors
top_predictors <- c(top_pos$Variable1, top_pos$Variable2, top_neg$Variable1, top_neg$Variable2)
cor_top_predictors <- subset(cor_df, Variable1 %in% top_predictors & Variable2 %in% top_predictors)

# Visualize the correlations using a heatmap
ggplot(cor_top_predictors, aes(x = Variable1, y = Variable2, fill = Correlation)) +
  geom_tile() +
  scale_fill_gradient2(low = "blue", mid = "white", high = "red", midpoint = 0) +
  geom_text(aes(label = round(Correlation, 2)), color = "black") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(title = "Correlation Heatmap of Top Predictors")

DAYS_BIRTH, REGION_RATING_CLIENT_W_CITY, REGION_RATING_CLIENT are the top 3 numerical predictors with positive correlation wrt TARGET, whereas EXT_SOURCE_3, EXT_SOURCE_2, EXT_SOURCE_1 are top 3 numerical predictors with negative correlation wrt TARGET.

4.2 Correlation of Target and Categorical Predictors

# Select character variables
character_vars <- app_train %>% 
                  select_if(is.character)

# Create a data frame to store correlation values
cor_df <- data.frame(variable = character(), correlation = numeric(), stringsAsFactors = FALSE)

# Calculate correlation with the target variable
for (var_name in colnames(character_vars)) {
  var <- character_vars[[var_name]]
  correlation <- cor(as.numeric(as.factor(var)), app_train$TARGET, use = "pairwise.complete.obs")
  cor_df <- cor_df %>% add_row(variable = var_name, correlation = correlation)
}
## Listing top categorical predictors

# Sort the data frame by correlation in descending order
sorted_df <- cor_df %>% arrange(desc(correlation))

# Select the top 5 categorical predictors
top_categorical_predictors <- head(sorted_df, 10)

# Print the top 5 categorical predictors
print(top_categorical_predictors)
##                      variable  correlation
## 1         NAME_EDUCATION_TYPE  0.054698602
## 2                 CODE_GENDER  0.054692262
## 3            NAME_INCOME_TYPE  0.046829434
## 4             OCCUPATION_TYPE  0.041419035
## 5           NAME_HOUSING_TYPE  0.034488603
## 6             NAME_TYPE_SUITE  0.009692820
## 7  WEEKDAY_APPR_PROCESS_START  0.004001745
## 8          NAME_FAMILY_STATUS -0.004127226
## 9             FLAG_OWN_REALTY -0.006148388
## 10               FLAG_OWN_CAR -0.021850938
# Reorder the levels of the categorical predictor variable
top_categorical_predictors$variable <- factor(top_categorical_predictors$variable,
                                             levels = top_categorical_predictors$variable[order(top_categorical_predictors$correlation)])
##Visualizing categorical predictors with Target

# Create a bar plot of the top categorical predictors
ggplot(top_categorical_predictors, aes(x = variable, y = correlation, label = round(correlation, 2))) +
  geom_bar(stat = "identity", fill = "steelblue") +
  geom_text(vjust = -0.5) +
  labs(x = "Categorical Predictor", y = "Correlation") +
  ggtitle("Top Categorical Predictors with Target Variable") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  coord_flip() +
  scale_fill_manual(values = c("steelblue" = "steelblue"), guide = "none")

NAME_EDUCATION_TYPE, CODE_GENDER, NAME_INCOME_TYPE are the top 3 categorical predictors which has positive correlation wrt TARGET.

5 Explore Bureau Data

Join application_{train|test}.csv with transactional data in, for example, bureau.csv or previous_application.csv. This will require aggregating the transactional data to have the same grain as the application data.

data <- read.csv('../data/application_train.csv')
data2 <- read.csv('../data/bureau.csv')
head(data)
summary(data)
head(data2)
summary(data2)

We can average together a select number of columns to condense the bureau table down to the client level from the credit level, which will help us join it to our application data (also at the client level).

data2 <- group_by(data2,SK_ID_CURR) %>%
  summarize(avg_days_credit = mean(DAYS_CREDIT),
            avg_credit_day_overdue = mean(CREDIT_DAY_OVERDUE),
            avg_days_credit_enddate = mean(DAYS_CREDIT_ENDDATE),
            avg_amt_credit_max_overdue = mean(AMT_CREDIT_MAX_OVERDUE),
            avg_cnt_credit_prolong = mean(CNT_CREDIT_PROLONG),
            avg_amt_credit_sum = mean(AMT_CREDIT_SUM),
            avg_amt_credit_sum_debt = mean(AMT_CREDIT_SUM_DEBT),
            avg_amt_credit_sum_limit = mean(AMT_CREDIT_SUM_LIMIT),
            avg_amt_credit_sum_overdue = mean(AMT_CREDIT_SUM_OVERDUE))
head(data2)
summary(data2)
##    SK_ID_CURR     avg_days_credit   avg_credit_day_overdue avg_days_credit_enddate avg_amt_credit_max_overdue avg_cnt_credit_prolong avg_amt_credit_sum  avg_amt_credit_sum_debt avg_amt_credit_sum_limit
##  Min.   :100001   Min.   :-2922.0   Min.   :   0.0000      Min.   :-41875.0        Min.   :       0           Min.   :0.00000        Min.   :        0   Min.   :-1083615        Min.   : -49014         
##  1st Qu.:188879   1st Qu.:-1431.0   1st Qu.:   0.0000      1st Qu.:  -718.0        1st Qu.:       0           1st Qu.:0.00000        1st Qu.:   103960   1st Qu.:       0        1st Qu.:      0         
##  Median :277895   Median :-1052.5   Median :   0.0000      Median :  -137.7        Median :       0           Median :0.00000        Median :   197287   Median :   43190        Median :      0         
##  Mean   :278047   Mean   :-1083.8   Mean   :   0.9659      Mean   :   597.8        Mean   :    7559           Mean   :0.00701        Mean   :   380721   Mean   :  169017        Mean   :   5167         
##  3rd Qu.:367185   3rd Qu.: -670.4   3rd Qu.:   0.0000      3rd Qu.:   569.0        3rd Qu.:    2037           3rd Qu.:0.00000        3rd Qu.:   397831   3rd Qu.:  145816        3rd Qu.:      0         
##  Max.   :456255   Max.   :    0.0   Max.   :2776.0000      Max.   : 31198.0        Max.   :47406123           Max.   :6.00000        Max.   :198072344   Max.   :51750000        Max.   :1755000         
##                                                            NA's   :76830           NA's   :275677                                    NA's   :13          NA's   :126580          NA's   :208930          
##  avg_amt_credit_sum_overdue
##  Min.   :      0           
##  1st Qu.:      0           
##  Median :      0           
##  Mean   :     46           
##  3rd Qu.:      0           
##  Max.   :1617404           
## 

Explore the joined transactional data. Do some of the added columns show promise in predicting default?

data2 %>%
  select(everything()) %>%
  summarise_all(funs(sum(is.na(.))))
skm <- skim(data2)
# Create DF showing Columns with missing data
skm_missing <- select(skm, skim_variable, n_missing, complete_rate) %>% filter(n_missing > 0) %>% arrange(complete_rate)
skm_missing

Some of these columns definitely show promise. There are 9 additional dimensions and only 5 of them have missing values. Of those 5 missing values, only 2 are missing more than half of their values. Only 2-3 of the 9 additional dimensions are missing substantial data.

5.1 Explore Joined Data

data3 <- left_join(data, data2, by = 'SK_ID_CURR')
head(data3)
summary(data3)

Now that the data is joined, let’s see how many records we could add additional dimensions to.

data3 %>%
  select(everything()) %>%
  summarise_all(funs(sum(is.na(.))))
skm2 <- skim(data3)
# Create DF showing Columns with missing data
skm_missing2 <- select(skm2, skim_variable, n_missing, complete_rate) %>% filter(n_missing > 0) %>% arrange(complete_rate)
skm_missing2

Avg_amt_credit_max_overdue only fills in about 9% of records and avg_amt_credit_sum_limit only matches about 28%. Those two variables may have to be thrown out, but all of the rest of them have matches of 50% or more. Many of the variables match a whopping 86% of records. This data is definitely usable and could be of value with the use of a model.

6 Missing Data

data <- read.csv('../data/application_train.csv')

The dataset is highly dimensional and has 121 independent variables listed. It is important to understand the degree of completeness of these variables in order to determine their usefulness in the model building phase of the project.

skm <- skim(data)
# Create DF showing Columns with missing data
skm_missing <- select(skm, skim_variable, n_missing, complete_rate) %>% filter(n_missing > 0) %>% arrange(complete_rate)
skm_missing

Of the 121 independent variables, it turns out half (61 rows) have missing data. Before determining what to do about each of these variables (removal, imputation, etc.) it is important to examine the columns and have a better understanding of what each of these columns may contribute in terms of useful information towards solving the eventual question of whether or not the customer is likely to default on the loan.

6.1 Character Columns

skm_char <- skm %>% filter(skim_type == 'character') %>% select(skim_variable,character.min,character.max,character.empty) %>% filter(character.empty > 0)
skm_char

Six of the 16 character variables contain missing information. Of those 6 columns, only 2 contain information that is informative towards the likelihood of customer default, NAME_TYPE_SUITE (NA = 1292) and OCCUPATION_TYPE (NA=96391). In both of these cases it is impossible know whether the NA signifies the lack of a value (i.e. unhoused or unemployed) of if the information is just incomplete. Therefore it is proposed to fill all NAs in both of this columns with a new value ‘unknown’. The other 4 columns will be dropped in light of the large number of missing values, and the small amount of variance in the values, in addition to the lack of relevance to the target variable outcome.

6.2 Numeric Housing Data

Of the 61 variables that have missing data, 47 of these variables describe the housing of applicant in significant detail. While an understanding of the housing situation is likely to be informative of the customer’s default risk, the detail provided in these columns is probably not necessary, much of the important information is included in NAME_HOUSING_TYPE and FLAG_OWN_REALTY which are both complete. It may be informative to disaggregate those that live in a house as opposed to an apartment, and we can look further into this possibility. However, inclusion of these 47 variables in the final model is probably unnecessary and so missing values in these columns can be ignored.

housing_stats_cols <- c('APARTMENTS_AVG','BASEMENTAREA_AVG','YEARS_BEGINEXPLUATATION_AVG','YEARS_BUILD_AVG','COMMONAREA_AVG',
                        'ELEVATORS_AVG','ENTRANCES_AVG','FLOORSMAX_AVG','FLOORSMIN_AVG','LANDAREA_AVG','LIVINGAPARTMENTS_AVG',
                        'LIVINGAREA_AVG','NONLIVINGAPARTMENTS_AVG','NONLIVINGAREA_AVG','APARTMENTS_MODE','BASEMENTAREA_MODE',
                        'YEARS_BEGINEXPLUATATION_MODE','YEARS_BUILD_MODE','COMMONAREA_MODE','ELEVATORS_MODE','ENTRANCES_MODE',
                        'FLOORSMAX_MODE','FLOORSMIN_MODE','LANDAREA_MODE','LIVINGAPARTMENTS_MODE','LIVINGAREA_MODE',
                        'NONLIVINGAPARTMENTS_MODE','NONLIVINGAREA_MODE','APARTMENTS_MEDI','BASEMENTAREA_MEDI',
                        'YEARS_BEGINEXPLUATATION_MEDI','YEARS_BUILD_MEDI','COMMONAREA_MEDI','ELEVATORS_MEDI','ENTRANCES_MEDI',
                        'FLOORSMAX_MEDI','FLOORSMIN_MEDI','LANDAREA_MEDI','LIVINGAPARTMENTS_MEDI','LIVINGAREA_MEDI',
                        'NONLIVINGAPARTMENTS_MEDI','NONLIVINGAREA_MEDI','FONDKAPREMONT_MODE','HOUSETYPE_MODE','TOTALAREA_MODE',
                        'WALLSMATERIAL_MODE','EMERGENCYSTATE_MODE')
# Remove housing data columms
skm_missing %>% filter(!skim_variable %in% housing_stats_cols)

6.3 Social Circle Data

Looking at the remaining columns with missing data 4 of them contain information about the applicant’s social circle and the 30-day and 60-day days past due default status, there is minimal missing data (n=1021)

social_data_df <- data %>% select(OBS_30_CNT_SOCIAL_CIRCLE,DEF_30_CNT_SOCIAL_CIRCLE,OBS_60_CNT_SOCIAL_CIRCLE,
                                  DEF_60_CNT_SOCIAL_CIRCLE)
summary(social_data_df)
##  OBS_30_CNT_SOCIAL_CIRCLE DEF_30_CNT_SOCIAL_CIRCLE OBS_60_CNT_SOCIAL_CIRCLE DEF_60_CNT_SOCIAL_CIRCLE
##  Min.   :  0.000          Min.   : 0.0000          Min.   :  0.000          Min.   : 0.0            
##  1st Qu.:  0.000          1st Qu.: 0.0000          1st Qu.:  0.000          1st Qu.: 0.0            
##  Median :  0.000          Median : 0.0000          Median :  0.000          Median : 0.0            
##  Mean   :  1.422          Mean   : 0.1434          Mean   :  1.405          Mean   : 0.1            
##  3rd Qu.:  2.000          3rd Qu.: 0.0000          3rd Qu.:  2.000          3rd Qu.: 0.0            
##  Max.   :348.000          Max.   :34.0000          Max.   :344.000          Max.   :24.0            
##  NA's   :1021             NA's   :1021             NA's   :1021             NA's   :1021
plot_OBS_30 <- ggplot(social_data_df, aes(OBS_30_CNT_SOCIAL_CIRCLE)) +
  geom_histogram()
plot_DEF_30 <- ggplot(social_data_df, aes(DEF_30_CNT_SOCIAL_CIRCLE)) +
  geom_histogram()
plot_OBS_60 <- ggplot(social_data_df, aes(OBS_60_CNT_SOCIAL_CIRCLE)) +
  geom_histogram()
plot_DEF_60 <- ggplot(social_data_df, aes(DEF_60_CNT_SOCIAL_CIRCLE)) +
  geom_histogram()

grid.arrange(plot_OBS_30,plot_DEF_30,plot_OBS_60,plot_DEF_60, ncol=2, top=textGrob('Social Circle Variable Distribution'))

Given that there are relatively few rows with missing data and the data is heavily right-skewed, it is proposed that missing data in these columns be replaced with imputed zero values (median= 0),

6.4 External Source Data

Included in the dataset are 3 externally sourced scores which have been normalized. There is wide variation in the number of values missing in each of these sources (\(NA_1=173378\),\(NA_2=660\),\(NA_3=60965\)).

summary(data %>% select(EXT_SOURCE_1,EXT_SOURCE_2,EXT_SOURCE_3))
##   EXT_SOURCE_1     EXT_SOURCE_2     EXT_SOURCE_3  
##  Min.   :0.01     Min.   :0.0000   Min.   :0.00   
##  1st Qu.:0.33     1st Qu.:0.3925   1st Qu.:0.37   
##  Median :0.51     Median :0.5660   Median :0.54   
##  Mean   :0.50     Mean   :0.5144   Mean   :0.51   
##  3rd Qu.:0.68     3rd Qu.:0.6636   3rd Qu.:0.67   
##  Max.   :0.96     Max.   :0.8550   Max.   :0.90   
##  NA's   :173378   NA's   :660      NA's   :60965
# Select rows with no External Source Data
n_empty_ext <- nrow(data %>% filter(is.na(EXT_SOURCE_1) & is.na(EXT_SOURCE_2) & is.na(EXT_SOURCE_3)))

Removing rows with missing data would shrink the dataset considerably. Imputation with the mean value for each column is a possibility. Another option is to create a new variable combining the scores into an AVG_EXT_SCORE since they are all normalized values. The benefit to this approach is that only 172 rows have no value for any of the 3 columns and so very little data would need to be imputed or removed. This approach takes advantage of available information, while reducing missing data. However it is possible that some information loss will occur with this approach if one of these sources turns out to be much more informative than one or both of the others.

6.5 Credit Bureau Data

The final large group of related data deals with credit inquiries from the credit bureau over the previous year. These are divided into the hour, day, week, month, quarter, and year (each time period is exclusive of the time period before it) prior to the application being submitted. This data is strongly right-skewed and is 86.5% complete (NAs = 41519).

summary(data %>% select(AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,
                        AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR))
##  AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_DAY AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_YEAR
##  Min.   :0.00               Min.   :0.00              Min.   :0.00               Min.   : 0.00             Min.   :  0.00            Min.   : 0.0              
##  1st Qu.:0.00               1st Qu.:0.00              1st Qu.:0.00               1st Qu.: 0.00             1st Qu.:  0.00            1st Qu.: 0.0              
##  Median :0.00               Median :0.00              Median :0.00               Median : 0.00             Median :  0.00            Median : 1.0              
##  Mean   :0.01               Mean   :0.01              Mean   :0.03               Mean   : 0.27             Mean   :  0.27            Mean   : 1.9              
##  3rd Qu.:0.00               3rd Qu.:0.00              3rd Qu.:0.00               3rd Qu.: 0.00             3rd Qu.:  0.00            3rd Qu.: 3.0              
##  Max.   :4.00               Max.   :9.00              Max.   :8.00               Max.   :27.00             Max.   :261.00            Max.   :25.0              
##  NA's   :41519              NA's   :41519             NA's   :41519              NA's   :41519             NA's   :41519             NA's   :41519

While we could remove the 41519 rows missing data, the loss of other information in those records might be valuable, and so it is proposed to impute the missing values with the median value for each column, in most cases this is 0.

6.6 Other Data

There are a few remaining columns that contain missing data OWN_CAR_AGE, AMT_GOODS_PRICE, AMT_ANNUITY, CNT_FAM_MEMBERS,and DAYS_LAST_PHONE_CHANGE. Of these columns only OWN_CAR_AGE has a significant number of missing values. There is another column which is related to car ownership, FLAG_OWN_CAR, which is a boolean field. Upon closer inspection it is evident that all of those records where FLAG_OWN_CAR = ‘N’ contain NAs, conversely only 5 records with a FLAG_OWN_CAR = ‘Y’ contain missing values.

CrossTable(data$FLAG_OWN_CAR, is.na(data$OWN_CAR_AGE), prop.r = F, prop.c = F, prop.chisq = F, dnn = c('Owns Car','Contains Missing Values'))
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  307511 
## 
##  
##              | Contains Missing Values 
##     Owns Car |     FALSE |      TRUE | Row Total | 
## -------------|-----------|-----------|-----------|
##            N |         0 |    202924 |    202924 | 
##              |     0.000 |     0.660 |           | 
## -------------|-----------|-----------|-----------|
##            Y |    104582 |         5 |    104587 | 
##              |     0.340 |     0.000 |           | 
## -------------|-----------|-----------|-----------|
## Column Total |    104582 |    202929 |    307511 | 
## -------------|-----------|-----------|-----------|
## 
## 

Therefore, it is proposed to remove these 5 records, and to convert the OWN_CAR_AGE into a factor variable with levels: No Car Owned, 0-3 years, 4-6 years, etc.

The other columns contain so few observations that it is proposed that those rows with missing values be excluded.

7 Problems with the data

7.1 Outlier Detection

skm_num <- skm %>% filter(skim_type == 'numeric') %>% filter(!skim_variable %in% housing_stats_cols) %>% select(skim_variable, numeric.p0, numeric.mean, numeric.p100, numeric.hist)
skm_num

Some modelling techniques are susceptible to outsized influence by outliers (regression for example). After examining the distribution of the numeric data a few columns stand out as containing likely outliers. The first examples are CNT_FAM_MEMBERS and CNT_CHILDREN.

# Create dataframe with columns of concern for plotting
num_problem_df <- data %>% select(CNT_CHILDREN, CNT_FAM_MEMBERS, AMT_INCOME_TOTAL, OWN_CAR_AGE, DAYS_LAST_PHONE_CHANGE, DAYS_EMPLOYED) %>% filter(!is.na(CNT_FAM_MEMBERS))

# Create Plots for family columns
plt_fam <- ggplot(num_problem_df, aes(CNT_FAM_MEMBERS)) +
  geom_boxplot()
plt_chld <- ggplot(num_problem_df, aes(CNT_CHILDREN)) +
  geom_boxplot()

grid.arrange(plt_fam, plt_chld, ncol=2, top=textGrob('Family Count Variables Distribution'))

Both of these columns (which are related) contain a small number of outlier records, however the information is consistent and is likely to have an influence on the target variable. As such, it is recommended that this variable is normalized before it is used in any modelling techniques that are suceptible to the presence of outliers.

8 Basic inspection of application train data

8.1 Scope of missing values

8.1.1 Rowwise completeness

app_train %>%
  mutate(n_missing = rowSums(is.na(.)),
         p_missing = n_missing/ncol(.)) %>%
  ggplot() +
  geom_histogram(aes(p_missing),
                 binwidth = 0.05, fill = "darkred", color = "white") +
  stat_bin(aes(p_missing, y = after_stat(count), label = ifelse(after_stat(count) == 0, "", after_stat(count))),
           geom = "text", binwidth = 0.05, size = 6, fontface = "bold", vjust = 0) +
  scale_x_continuous(breaks = seq(0,1,0.1), minor_breaks = NULL) +
  scale_y_continuous(labels = ~paste0(.%/%1000, "k")) +
  labs(title = "Distribution of missing values by row",
       x = "percent missing") +
  theme_minimal()

8.1.2 Columns with most NA’s

app_train %>%
  summarise(across(everything(), ~sum(is.na(.)))) %>%
  pivot_longer(everything(),
               names_to = "col",
               values_to = "n_missing") %>%
  mutate(p_missing = percent(n_missing / nrow(app_train), 0.1)) %>%
  arrange(-n_missing) %>%
  head(20)

8.2 Inspecting numerical variables

8.2.1 Class conversion

There are many categorical variables in this table which are encoded numerically. With 122 total columns, it’s rather arduous to distinguish true numerical variables but selecting columns that only contain numbers (and NA’s) is an obvious start.

# Ensuring data types were properly attributed during import
app_train %>%
  summarise(across(everything(), 
                   list(assigned = class,
                        observed = \(xx){
                          case_when(grepl("^(\\d|\\.|-)+$", xx) | is.na(xx) ~ "numeric", 
                                    .default = "character")
                        },
                        vsample = ~as.character(.))) %>%
            first) %>%
  pivot_longer(everything(),
               names_to = c("column", ".value"),
               names_pattern = "(.*)_(.*)") %>%
  filter(assigned != observed) # subset mismatches

Since no columns where misclassed during the importing process, we can move on to subsetting the numerical columns.

app_train %>%
  select(where(is.numeric))

106/122 (87%) of columns in the train set are classified as numeric so this subsetting didn’t do much for us. It’s unlikely that categorical variables would be encoded with decimals/doubles or negative numbers, so we can remove these for now while we look through the rest of the variables’ description in the column description file.

app_train %>%
  select(where(is.numeric)) %>%
  mutate(across(everything(), as.character)) %>%
  select(where(~ all(!grepl("\\.|-", .)))) %>%
  colnames() %>%
  tibble(column = .)

This brings us down to a much more manageable 50 columns. Further determination can be aided by the column description file. Once all categorical variables have been identified, they can be converted to factors. The new factors can easily be reconverted later if need be.

num2fac <- app_train %>%
  select(where(is.numeric)) %>%
  mutate(across(everything(), as.character)) %>%
  select(where(~ all(!grepl("\\.|-", .)))) %>%
  select(-c(own_car_age, hour_appr_process_start, matches("^(obs|def|amt|cnt)"))) %>%
  colnames() 

app_train1 <- app_train %>%
  mutate(across(c(where(is.character), all_of(num2fac)), factor)) 

8.2.2 Distribution and other characteristics

app_train1 %>%
  select(where(is.numeric)) %>%
  summarise(across(everything(), 
                   list(avg = ~mean(., na.rm = TRUE),
                        med = ~median(., na.rm = TRUE),
                        max = ~max(., na.rm = TRUE),
                        min = ~min(., na.rm = TRUE),
                        sd = ~sd(., na.rm = TRUE),
                        var = ~var(., na.rm = TRUE),
                        outupper = ~soutlier(.),
                        outlower = ~soutlier(., lower = TRUE))) %>%
              round(2)) %>%
  pivot_longer(everything(),
               names_to = c("column", ".value"),
               names_pattern = "(.*)_(.*)")

9 Addressing problematic columns

Some missing values in code_gender and organization_type were encoded with the string XNA instead of being explicitly missing.

unique(app_train1$code_gender)
## [1] M   F   XNA
## Levels: F M XNA
unique(app_train1$organization_type)
##  [1] Business Entity Type 3 School                 Government             Religion               Other                  XNA                    Electricity            Medicine               Business Entity Type 2
## [10] Self-employed          Transport: type 2      Construction           Housing                Kindergarten           Trade: type 7          Industry: type 11      Military               Services              
## [19] Security Ministries    Transport: type 4      Industry: type 1       Emergency              Security               Trade: type 2          University             Transport: type 3      Police                
## [28] Business Entity Type 1 Postal                 Industry: type 4       Agriculture            Restaurant             Culture                Hotel                  Industry: type 7       Trade: type 3         
## [37] Industry: type 3       Bank                   Industry: type 9       Insurance              Trade: type 6          Industry: type 2       Transport: type 1      Industry: type 12      Mobile                
## [46] Trade: type 1          Industry: type 5       Industry: type 10      Legal Services         Advertising            Trade: type 5          Cleaning               Industry: type 13      Trade: type 4         
## [55] Telecom                Industry: type 8       Realtor                Industry: type 6      
## 58 Levels: Advertising Agriculture Bank Business Entity Type 1 Business Entity Type 2 Business Entity Type 3 Cleaning Construction Culture Electricity Emergency Government Hotel Housing ... XNA

This can be fixed with the following:

app_train1 %>%
  select(code_gender, organization_type) %>%
  mutate(across(c(code_gender, organization_type), 
                ~case_when(. != "XNA" ~ .))) %>% # Fix occurs here
  sapply(., unique) %>% # Displaying the results 
  tibble(cols = names(.),
         val = .) %>%
  unnest(val) %>%
  arrange()

days_employed contains some erroneous values. Most values of days_employed are negative and yield reasonable values when converted to positive years with \(\left( x \div 365 \right) \times -1\) but those beginning > 0 equate to less than -1,000 years. What’s more is that all of the values greater than 0 equal 365243. It’s possible that this number may be used as a factor to calculate time or define a time interval. 365 (days) 24 (hours) 3 (?) is rather coincidental.

# Maximum is extremely large and mean is greater than 3rd quartile
app_train1$days_employed %>%
  summary
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -17912   -2760   -1213   63815    -289  365243
# All values > 0 equal 365243
app_train1 %>%
  filter(days_employed > 0) %>%
  select(days_employed) %>%
  summary
##  days_employed   
##  Min.   :365243  
##  1st Qu.:365243  
##  Median :365243  
##  Mean   :365243  
##  3rd Qu.:365243  
##  Max.   :365243
# days_employed converted to years
app_train1 %>%
  select(days_employed) %>%
  mutate(year_conv = round((days_employed / 365) * -1, 2)) %>%
  head(10)

This can be solved with the following:

app_train1 %>%
  mutate(days_employed = case_when(days_employed <= 0 ~ days_employed))

Time to repair all 3 variables, and update the data frame.

app_train2 <- app_train1 %>%
  mutate(across(c(code_gender, organization_type), 
                ~case_when(. != "XNA" ~ .)),
         days_employed = case_when(days_employed <= 0 ~ days_employed))

9.1 Erroneous Data

In the case of AMT_INCOME_TOTAL there is a single outlier that lies so far out of the scale (AMT_INCOME_TOTAL = 117,000,000) and is inconsistent with other data, and so is probably erroneous. Even if not erroneous, the value added by this datapoint to any model is negligible as the core customer group is not multi-millionaires.

plt_inc <- ggplot(num_problem_df, aes(AMT_INCOME_TOTAL)) +
  geom_boxplot() +
  ggtitle('Total Income Distribution')

# data %>% filter(AMT_INCOME_TOTAL > 100000000)
plt_inc2 <- ggplot(data %>% filter(AMT_INCOME_TOTAL < 100000000), aes(AMT_INCOME_TOTAL)) +
  geom_boxplot() +
  ggtitle('Total Income Distribution (After Removal)')

grid.arrange(plt_inc,plt_inc2, ncol=2, top=textGrob('Total Income Variable Distribution'))

Even after removal, outliers remain and measures (see above) will have to be taken to ensure they do not overly influence predictive models. In the case of such strongly skewed data, it might be advisable to convert the continuous variable into a categorical one and break the data into bins for use in models.

ggplot(num_problem_df, aes(OWN_CAR_AGE)) +
  geom_histogram(bins = 60) +
  ggtitle('Age of Cars Owned Distribution')

data %>% filter(OWN_CAR_AGE > 55) %>% select(SK_ID_CURR, OWN_CAR_AGE) %>% group_by('Cars older than 55 years' = as.factor(OWN_CAR_AGE))%>%count()

There is an odd anomaly in the data such that a large number of owned cars are recorded as being 64-65 years old. It seems unlikely that this data correctly represents reality. The amount of erroneous data is small and so removal wouldn’t impact the overall dataset to a large degree. However, the other data contained in these records may be informative, and so it is proposed that if the variable OWN_CAR_AGE is to be used in a model, that these rows be removed.

ggplot(num_problem_df, aes(DAYS_LAST_PHONE_CHANGE)) +
  geom_histogram() +
  ggtitle('Days Since Previous Phone Change')

data %>% filter(DAYS_EMPLOYED > 0) %>% select(SK_ID_CURR, DAYS_EMPLOYED) %>% group_by('Days Employed' = as.factor(DAYS_EMPLOYED))%>%count()
ggplot(num_problem_df %>% filter(DAYS_EMPLOYED < 0), aes(DAYS_EMPLOYED)) +
  geom_histogram() +
  ggtitle('Days Employed')

de_df <- data %>% filter(DAYS_EMPLOYED < 0) 

Finally, DAYS_EMPLOYED as discussed above, has significant data integrity issues. The way the data is set up only values less than or equal to 0 should be permissible. However, a significant number of rows (n = 55374) all have the same positive value, DAYS_EMPLOYED = 365243. Not only should the value be negative, but the scale is also outside of realistic values (365243 days = 1000+ years). Due to these rows comprising a significant portion of the data set (18%) removal would be detrimental to model accuracy, so it is proposed that the data be imputed for these rows, being replaced by the column median (median = -1648)

10 Results

11 Contributions